Machine learning-powered estimation of malachite green photocatalytic degradation with NML-BiFeO3 composites

This study explores the potential of photocatalytic degradation using novel NML-BiFeO3 (noble metal-incorporated bismuth ferrite) compounds for eliminating malachite green (MG) dye from wastewater. The effectiveness of various Gaussian process regression (GPR) models in predicting MG degradation is investigated. Four GPR models (Matern, Exponential, Squared Exponential, and Rational Quadratic) were employed to analyze a dataset of 1200 observations encompassing various experimental conditions. The models have considered ten input variables, including catalyst properties, solution characteristics, and operational parameters. The Exponential kernel-based GPR model achieved the best performance, with a near-perfect R2 value of 1.0, indicating exceptional accuracy in predicting MG degradation. Sensitivity analysis revealed process time as the most critical factor influencing MG degradation, followed by pore volume, catalyst loading, light intensity, catalyst type, pH, anion type, surface area, and humic acid concentration. This highlights the complex interplay between these factors in the degradation process. The reliability of the models was confirmed by outlier detection using William’s plot, demonstrating a minimal number of outliers (66–71 data points depending on the model). This indicates the robustness of the data utilized for model development. This study suggests that NML-BiFeO3 composites hold promise for wastewater treatment and that GPR models, particularly Matern-GPR, offer a powerful tool for predicting MG degradation. Identifying fundamental catalyst properties can expedite the application of NML-BiFeO3, leading to optimized wastewater treatment processes. Overall, this study provides valuable insights into using NML-BiFeO3 compounds and machine learning for efficient MG removal from wastewater.

The methodology used in this study for modelling and optimizing MG dye photocatalysis using NML-BiFeO 3 compounds is depicted in Fig. 1.This methodology draws on insights from our previous research.Figure 1 illustrates that the study is conducted through three distinct stages.The initial step involves selecting ten parameters that significantly influence degradation efficiency, followed by designing and collecting 1200 data points through experimentation.In the second phase, an extensive comparison of four different kernel functions in the GPR model is conducted to identify the most suitable configurations for accurately predicting the efficiency of MG dye elimination throughout the photocatalytic procedure.Following this, the development of photocatalytic behaviour is determined by leveraging the four models exhibiting superior performance with higher R-squared values and lower error rates.

Data preparation
We used 1200 data points in this investigation, which were obtained from a previous photocatalytic study 64 .Table S1 comprehensively lists the providers, concentrations, chemical formulas, labels, and intended applications of all chemicals used in this present investigation.
This study included a comprehensive set of 10 distinct features, meticulously selected because of their relevance and potential influence on the photocatalytic process under examination.These features encompassed a variety of parameters, including the type of catalyst employed, the duration of the experiment (in minutes), the surface area of the catalyst material (expressed in m 2 /g), the pore volume of the material (in cm 3 /g), the intensity of the illumination (in watts), the quantity of catalyst loaded into the system (in g/L), in-solution MG dye concentration (in mg/L), the pH of the solution, the concentration of humic acid (in mg/L), and the presence or absence of specific anions.The output variable was the efficacy of MG dye degradation.
Within the data preparation phase, particular attention focused on two categorical input variables: the anions and catalyst types.We employed a new strategy to convert these attributes into numerical representations.To characterize catalyst types, a linear combination of the normalized surface area and pore volume of the catalysts was chosen.In addition, to characterize anion types, the normalized molecular weight of each anion was considered.It is worth noting that the normalization was carried out within the range of 0 to 1.
This conversion was deemed essential to ensure that the data met the stringent numerical prerequisites of ML algorithms, enabling seamless integration into the subsequent analytical processes.Preceding the commencement of machine learning model construction, a pivotal procedural step entailed randomly partitioning the dataset into two discrete subsets.Explicitly, 75% of the dataset was earmarked for utilization as the training dataset, whereas the remaining 25% was earmarked to serve as the test dataset.This division's rationale was to facilitate a comprehensive evaluation of the machine learning models post-training.This partitioning strategy ensured that the models were rigorously assessed on unseen data, gauging their generalization capabilities beyond the training phase.

Gaussian process regression (GPR)
A powerful and well-structured machine learning approach, the GPR model, is well-regarded for its probabilistic and nonparametric characteristics.It can handle complex problems that involve non-linear relationships 65 .
A key feature of this approach is the use of Gaussian processes for regression tasks.A significant aspect of its attractiveness arises from its capacity to efficiently incorporate uncertainty within its computational framework 53 .
In the context of GPR modelling, it is conventional to utilize two separate datasets: one allocated explicitly for training purposes (L) and another intended for testing (T).These datasets, T and L, are selected at random and comprise sets , and , where 'x' denotes the entered parameters and 'y' corresponds to the associated result factors.The following Equation establishes the basis of GPR modelling: Here, 'xL' signifies the individual factors, whereas 'yL' signifies the consequences linked to the training data sets.Furthermore, 'ε' serves as the notation for observation noise, 'σ 2 noise ' represents the noise variance, and 'In' denotes the unit array in this context.In the same vein, we can articulate the following for the test dataset: (1) The symbols retain their previously defined meanings, but in this case, they pertain to the test dataset.Consequently, the Gaussian noise model links each computed 'y' value to the corresponding 'f(x)' function under consideration.As postulated by the GPR paradigm, 'f(x)' assumes the role of a stochastic function, and its characterization is contingent upon the concurrent utilization of the mean function' m(x)' and the covariance function' k(x, x′), ' regularly recognized as kernel functions.
It is possible to find the mean function "m(x)" by using specified basis functions; nonetheless, it is commonly approximated as zero for simplification and computational convenience 66 .
Concluding the previously mentioned criteria and variables, the following deductions can be made: Incorporating the most recent pair of equations, we can derive the subsequent Gaussian expression: By applying the Gaussian conditioning principle, we can acquire the distribution for the variable' y T .': In this scenario, Σ T represents the covariance, while μ T signifies the mean value.A GPR model's predictive power and resilience are influenced by kernel function with a non-singular symmetric template.Four options-Squared exponential, Exponential, Matern, and Rational quadratic-have been selected to identify the best-suited kernel function.Presented below are the selected kernel functions: Rational quadratic kernel function: Matern kernel function: Squared Exponential kernel function: Exponential kernel function: (3)

Performance metrics
The performance of the established models depended on the data quality and input factors.To measure model performance, a set of statistical measures were employed: the coefficient of determination (R 2 ), root-mean-square error (RMSE), and mean absolute error (MAE).The subsequent equations delineate these parameters: Here, "n" signifies the total number of samples considered."oi" represents the observed removal efficiencies, whereas "pi" stands for the calculated removal efficacies.Furthermore, "p" holds the significance of being the mean value derived from all anticipated effectiveness quantities.

Model development and testing
This study employed MATLAB software version 2018 to develop GPR models for predicting MG dye photocatalytic degradation.Table 1 compares our findings with previous research on organic pollutant degradation.The GPR models developed here achieved superior R-squared values and lower MAE and RMSE values compared to a significant portion of the existing literature.High R-squared values indicate strong agreement between predicted and experimental degradation values, validating the effectiveness of the models.
We examined error characteristics (STD, RMSE, MSE, MRE) to assess the training performance of the recommended GPR models.The error metrics indicate that the models effectively captured patterns and trends in the training data.Notably, the GPR model with an exponential kernel demonstrated excellent accuracy in predicting MG dye degradation for unseen data.Its high R-squared value (1.0) and low error metrics highlight its superior predictive capabilities.
This exceptional performance suggests the model's effectiveness in handling the complexities of MG dye photocatalytic degradation in wastewater, with potential applications in carbon capture and utilization.The multifaceted nature of the experimental design and the inclusion of diverse input features contribute to the richness and comprehensiveness of this study, leading to a more meaningful understanding of the underlying phenomena.Consequently, the GPR model's predictive performance emerges as a more reliable and suitable solution for addressing real-world challenges in this domain.
The correctness of the proven models is further validated by the simultaneous presentation of the anticipated and experimental values for the photocatalytic degradation of the MG dye in Fig. 2. Upon careful examination of the data, it is evident that the photocatalytic destruction of the experimental MG dye aligns with the many GPR models.This agreement precisely demonstrates the models' ability to predict the MG dye photocatalytic degradation in NML-BiFeO3 composites.A broad investigation of the presented models shows a strong match between the anticipated and observed MG dye photocatalytic degradation rates.This tight correlation shows that GPR models can accurately predict MG dye photocatalytic degradation in NML-BiFeO 3 .The algorithms' exact alignment between predicted and observed values shows their capacity to precisely capture photocatalytic degradation events, which could impact wastewater treatment.The remarkable effectiveness of GPR models enhances the field of model prediction as researchers gain more confidence in using these models to make predictions about MG dye removal efficiency and improve processes linked to photocatalytic degradation.
The visual representation in Fig. 3 illustrates the prediction accuracy of GPR models in the process of MG dye photocatalytic degradation compared to the data collected from experiments.The graph demonstrates an important link above 1.000 between the predicted and experimental outcomes.
The exact synchronization of the matching lines with the 45° line indicates the systems' accuracy in detecting complicated degradation trends.The precise positioning along the dividing line, especially in the GPR model using the Matern kernel function, achieves an impeccable correlation value of 1.The graph is an essential tool for evaluating the accuracy of GPR models in forecasting the photocatalytic degradation of MG dye within the NML-BiFeO 3 composite.Researchers gain vital knowledge on the accuracy of models, which helps improve wastewater treatment technologies and informs choices in academic and commercial contexts.The excellent accuracy shown by the Matern kernel-equipped GPR model distinguishes it as a noteworthy instrument for forecasting MG dye photocatalytic degradation with unprecedented precision.
Figure 4 illustrates and communicates crucial information about the predictive efficacy of GPR models in the context of MG dye photocatalytic degradation.The figure prominently displays the differences between experimentally measured MG dye photocatalytic degradation values and the corresponding estimated values obtained from GPR models.The accuracy of different GPR models is evaluated based on their ability to predict MG dye photocatalytic degradation.The Rational Quadratic and Squared Exponential kernel functions are highlighted for their remarkable accuracy.The relative deviation points for these models are reported to be below 30%, demonstrating a tight correlation across expected and investigational results.The relative deviation points of the Exponential kernel function are less than 1%, while the Matern kernel function stands out for its superior accuracy, showcasing absolute deviation points below 0.1%.This suggests a high precision in capturing the underlying behavior of photocatalytic degradation.The accuracy and reliability of the GPR models, especially those using specific kernel functions, are emphasized.This supports their credibility for predicting MG dye photocatalytic degradation in the NML-BiFeO 3 composite.The discoveries indicate that this information could help scholars choose the most appropriate GPR systems for different purposes, particularly in wastewater treatment and employment inquiry.The overall aim is to contribute to sustainable solutions by improving the understanding and prediction of dye pollutant emissions.The insights from Fig. 4 regarding GPR models' MG dye photocatalytic degradation predictions are significant.The emphasis on kernel functions and accuracy levels helps scientists select the best models for specific functions, boosting wastewater treatment and sustainable solutions research.Figure 5 compares the current four GPR models with the models developed by Jaffari et al. 64 to estimate the efficiency of MG photocatalytic degradation with NML-BiFeO 3 composites.As can be seen, the current models achieve higher accuracy compared to the literature models, evidenced by lower errors and higher R-squared values.

Sensitivity analysis
Sensitivity inquiry is conventionally carried out to explore the impact of input factors on the resultant output quantity 67 .As part of this in-depth analysis, it is imperative to consider the relevance factor, represented as 'r, ' which serves as the primary indicator of the input parameter exerting the most significant impact on MG photocatalytic degradation with NML-BiFeO 3 composites.This influential parameter can be quantified using the ensuing Equation: Within the presented framework, a variety of notations are employed, each possessing specific meanings: X k.i is indicative of the 'kth' input parameter, X k represents the average value of input parameters, Y i signifies the 'ith' output, Y denotes the average of outputs, and 'n' denotes the total quantity of data points included in the analysis.Typically, the 'r' value exhibits variation within the range of −1 to + 1.It is worth emphasizing that the absolute value of 'r' measures how each input variable affects the output variable.A higher absolute value of 'r' signifies a more pronounced correlation between each input and its output.Notably, negative values represent a situation where higher input values correspond to lower output values, while positive values indicate that higher www.nature.com/scientificreports/input values are associated with higher output values 68 .The work includes a visually captivating representation in Fig. 6, which is significant.The sensitivity study illuminates the complex interplay between input parameters and MG photocatalytic degradation, successfully identifying the crucial factors that contribute to the process.
Analyzing feature significance with the GPR model allows us to comprehend the impact of operational factors on the photodegradation estimate of MG dye.Our investigation focused on understanding how various input features influenced the GPR model's overall accuracy.Figure 6 presents the resulting assessment of the relative importance of these input features.www.nature.com/scientificreports/Pore volume and catalyst loading contribute 20% each, followed by light intensity at 19%.Catalyst type contributes 18%, followed by the pH of the solution at 16%.Anion type contributes 12%, surface area contributes 14%, and humic acid concentration contributes 4%.The most important factor in this situation is the photocatalytic process's time.. Notably, the gap in relative significance between the most critical factor, represented by time, and the least significant factor, exemplified by humic acid concentration, exceeds 80%.It becomes apparent that the degradation of MG dye was markedly impacted by the input factors linked to the circumstances of the photocatalytic process, as illustrated in the inset of Fig. 6.Further scrutiny of the GPR model involved a thorough investigation through a permutation significance assessment.This method discerns the decrement in model effectiveness resulting from the random reshuffling of an individual feature 69 .This procedure creates a disconnect between the input attributes and the effectiveness of MG dye degradation, leading inexorably to a downturn in the model's performance rating, thereby underscoring the model's dependence on these precise attributes.

Outlier detection
Data points deemed outliers or giving rise to suspicion demonstrate dissimilar behaviour in comparison to the remaining data, and this disparity is frequently attributed to experimental irregularities or instrumental inaccuracies.To enhance the efficiency of the determined model and prevent erroneous analysis, it is imperative to identify and address potentially problematic data within the dataset.To streamline this procedure, we employ the Leverage method, a technique in which the Hat matrix is precisely articulated as follows: U is characterized as a matrix with sizes i*j, where i denotes the parameter count, and j represents the number of training data points.A visual depiction known as a Williams plot is produced to evaluate the veracity of the information.This analysis involves plotting standardized residuals against Hat values, allowing for a comprehensive evaluation; any data falling outside a designated region is considered potentially questionable.This dependable zone is a narrow space encompassing Hat values and residuals with a standard deviation between −3 and 3, ranging from 0 to the limits of critical leverage.The calculation for the limits of critical leverage is determined as follows 70,71 : Drawing insights from William's plot of the MG photocatalytic degradation data bank (Fig. 7), one can infer that a significant portion of the data employed in the analysis is deemed reliable.To provide a more detailed breakdown, out of a total of 1200 data points, only 71, 68, 69, and 66 outliers were identified for the GPR-Rational quadratic models, GPR-Squared Exponential, GPR-Exponential, and GPR-Matern, respectively.

Implications and drawbacks of the current study
The utilization of NM-BiFeO 3 composites reveals considerable promise as a viable option for catalyzing the degradation of organic contaminants in aqueous environments.Experimental measurements involving controlled variables are usually employed in the conventional approach to establish the correlation between degradation effectiveness and reaction settings.However, these hands-on experiments often come with high costs, consume significant time, and need help achieving broad approval.This study employed four proficient ML models to illustrate the performance of MG dye photodegradation.This highlights a notable potential for promptly forecasting empirical outcomes using predetermined settings.The study also identified the key attributes of a photocatalyst's surface characteristics.It assessed their influence on the material's effectiveness in degrading organic pollutants and facilitating selective conditions during photocatalytic reactions for treating www.nature.com/scientificreports/organic wastewater.Applying this method will substantially diminish the necessity for extensive experimental exploration, resulting in cost savings and expediting the utilization of NML-BiFeO 3 compounds in organic wastewater treatment.The current investigation underscores ML as a promising avenue for forecasting NML-BiFeO 3 -assisted photodegradation of MG dye compounds under controlled parameters.However, it is important to acknowledge limitations.Photocatalytic performance can be significantly influenced by various other factors, including temperature, pore volume, and catalyst loading.Additionally, this study does not account for the presence of multiple organic contaminants within a real-world wastewater treatment scenario.Fluctuations in these parameters could introduce discrepancies in the model, modify the significance of features, and limit the model's generalizability due to the absence of experimental data for these conditions.Future research will prioritize understanding the influence of these variables on the NML-BiFeO 3 photocatalytic process.The model will be further refined by incorporating additional data to enhance its precision and broaden its applicability to a wider range of organic pollutants.It is important to note that different organic pollutants may behave differently within photocatalytic systems.Therefore, further exploration using readily available datasets and a comprehensive investigation of these variables' influence on the photocatalytic breakdown of various organic pollutants in wastewater is warranted.

Conclusions
In this study, we investigated the potential of various Gaussian process regression (GPR) models for predicting malachite green (MG) dye degradation using noble metal-incorporated bismuth ferrite (BiFeO 3 ) (NML-BiFeO 3 ) photocatalysts.The GPR models significantly outperformed existing methods in predicting MG degradation efficacy, achieving exceptional accuracy.This high accuracy is validated by the high R 2 values and low error metrics.The exponential kernel-based GPR model demonstrated the most exceptional performance, with a near-perfect R 2 value of 1.0 and minimal errors.This establishes its exceptional suitability for forecasting MG photocatalytic degradation in wastewater treatment.The close alignment between predicted and experimental results underscores the reliability of the GPR models in estimating degradation rates.This precision strengthens the foundation for utilizing GPR models to guide decision-making and optimize processes related to MG dye degradation.
Notably, the Rational Quadratic and Squared Exponential kernel models exhibited significant accuracy, with deviations below 30%.The Exponential kernel achieved exceptional precision with less than 1% deviation, while www.nature.com/scientificreports/ the Matern kernel surpassed all others with a deviation of less than 0.1%.These findings highlight the remarkable accuracy of these models, particularly those employing specific kernels, for predicting MG dye degradation using NML-BiFeO 3 photocatalysts.These insights empower researchers to select the most appropriate GPR systems for wastewater treatment applications, ultimately contributing to advancements in sustainability efforts.Furthermore, the study identified crucial input factors influencing MG photocatalytic degradation through a comprehensive sensitivity analysis.The direct correlation between the input parameters and the degradation process reveals the complex interplay between these factors.Analyzing feature significance using the GPR model revealed that process time is the most influential factor, followed by pore volume, catalyst loading, light intensity, catalyst type, pH, anion type, surface area, and humic acid concentration.
The reliability of the data employed in the analysis is further supported by insights gleaned from William's plot.Notably, a minimal portion of the 1200 data points (ranging from 66 to 71 data points depending on the GPR model) were identified as outliers.This signifies the robustness of the data employed for model development.
In conclusion, this study demonstrates the promising potential of NML-BiFeO 3 composites for catalyzing the degradation of organic contaminants in wastewater.The utilization of GPR models for forecasting MG dye photodegradation offers a powerful tool for rapid and efficient prediction of empirical outcomes.Identifying key catalyst surface properties can significantly expedite the application of NML-BiFeO 3 in organic wastewater treatment, leading to reduced costs and streamlined experimental procedures.Future research endeavors should explore the incorporation of additional variables to further enhance model accuracy and broaden applicability..

Figure 4 .
Figure 4.A comparison of the prediction performance of GPR models using (a) Exponential, (b) Matern, (c) Squared exponential, and (d) Rational quadratic versus empirical information.

Figure 5 .
Figure 5. Statistical comparison of the current GPR models with the Jaffari et al. 64 models.
Within this context, the parameters ℓ, σ 2 , σ, and α correspond to length scale, variance, amplitude and scale mixture, respectively.Furthermore, the symbols v, Γ, and Kv were employed to signify a positive parameter, gamma function, and modified Bessel function, respectively.